Exploration of white wine quality by Sai Venkat Kotha

The dataset being explored in this analysis contains information about the quality ratings of several white wines. Each instance contains a quality rating for the wine between 0(very bad) and 10(very excellent) and also provides information about various chemical properties of the wine.

Structure of the data

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

It can be seen that the dataset contains 4898 instances and there are 11 variables which describe the chemical properties of the wine. The variable ‘quality’ provides a rating for the wine.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Exploring individual variables

The distribution of the counts of ‘fixed.acidity’ variable looks normal. Most of the wines have a fixed.acidity value between 6 and 7.5 and there are very few wines whose whose fixed acidity is less than 5 or more than 9.

From the above histogram it can be noticed that the value of ‘volatile.acidity’ is a decimal usually between 0 and 1. Most of the wines have a volatile.acidty value of 0.2 or 0.3 and there are very few wines whose volatile.acidity value is more than 0.5.

The amount of citric acid in about half of the wines is 0.3 and about 75% of the wines have a citric acid amount less than 0.4. There are also a few outliers where the amount of citiric acid is more than 1.

The distribution of the ‘residual.sugar’ variable looks skewed with almost half of the wines have a residual.sugar value between 0 and 5. About 75% of the wines have a residual.sugar value less than 10 and there are a few outliers whose residual.sugar value is more than 20.

Transformed the skewed distribution of ‘residual.sugar’ using log10 to acheive a better distribution. The transformed distribution looks bimodal with peaks at around 3 and 9.

The ‘chlorides’ variable represents the amount of salt in the wine. Most of the wines have a chlorides amount of 0.04 and there are very few wines whose chlorides amount is more than 0.06.

The above histogram shows that most of the wines have a free sulfur dioxide level between 20 and 40. About 75% of the wines have free sulfur dioxide levels between 2 and 45. There are less number of wines whose free sulfur dioxide level is more than 60.

The amount of total sulfur dioxide in wines have a wider range than free sulfur dioxide as it is the sum of free and bound sulfur dioxide. The distribution looks normal with most of the wines having a total sulfur dioxide levels between 100 and 160. The average amount of total sulfur dioxide in the wines is 138.4 gm/litre.

The value of lowest alcohol content in a wine is 8 and the value of highest alcohol content is around 14. Most of the wines have alcohol content between 9 and 10.5. About 75% of the wines have an alcohol content value between 8 and 11.5.

Most of the wines received a quality rating from 5 to 7. There are very few wines with a quality rating more than 7 and there are no wines which received a rating of 10. It can also be seen that the lowest quality rating received is 3.

Analysis of individual variables

Structure of the dataset

The dataset contains 4898 instances of quality ratings received by various white wines. Each instance also contains other variables which represent the chemical factors of the wine. These variables are ‘fixed.acidity’, ‘volatile.acidity’, ‘citric.acid’, ‘residual.sugar’, ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’, ‘density’, ‘pH’, ‘sulphates’, ‘alcohol’. All the variables other than quality are represented by decimal values.

About 50% of the wines received a rating of 6 or lower and only 25% of wines received a rating greater than 6. Most of the wines have alcohol content between 9 and 11.

Main features of interest in the dataset

The main features of interest in the dataset are alcohol and quality. I would like to dtermine if the amount of alcohol affected the rating of a wine.

Other features in the dataset that might help support the
investigation into the features of interest

The other features that could support the investigation are residual.sugar, density and total sulfur dioxide. The amount of residual sugar in wine that sweetens the wine could have affected the rating of the wine. The amount of total sulfur dioxide is also important as it pevents the oxidation of the wine. Excess oxidation causes the wine to degrade and lose its aroma and taste.

Unusual distributions and data transformations

The distribution of the amount of sugar represented by residual.sugar variable had a skewed distribution. Transforming this distribution using log10 transformation changed the distribution to bimodal with peaks at 3 and 9.

Analyzing relationships among two variables

Correlation coefficients

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000

The above plot indicates that the wines with higher quality have high amounts of alcohol. The wines which received a lower quality rating from 3 to 5 have average alcohol amount of around 10% whereas the wines with higher quality ratings have average alcohol amount more than 11%.

It can be noticed that the wines with higher quality have lower density than that of the wines with lower quality. The density of almost all the wines is within the range 0.985 to 1.

The amount of residual sugar appears to be in same amounts across all the qualities of the wine.

The amount of total sulfur dioxide does not seem to have any relationship with the quality of the wine. The wines of all quality have total sulfur dioxide amounts in the same range between 50 and 250.

The relationship is not very strong between residual sugar and total sulfur dioxide. There are a large number of wines with residual sugar amount less than 5 and having a total sulfur dioxide amount between 50 and 200. As the amount of residual sugar in wine increases there is a slight increase in the amount of total sulfur dioxide as well.

The relationship between residual.sugar and density looks almost linear. The wines that have more amounts of residual sugar also have a higher density than the wines with lower amounts of residual sugar.

There is no strong relationship between residual.sugar and alcohol. But it can be noticed that as the amount of alcohol in the wine increased the variance in the amount of residual sugar decreased. And the wines with highest alcohol amount had residual sugar amount of around 10 gm/litre whereas there are wines with lower alcohol amount with residual sugar amount of 20 gm/litre.

The above plot reveals a slight negative linear relationship between alcohol and density. It can be seen that the wines with high amounts of alcohol have less density that that of the wines with low amounts of alcohol.

Analysis of bi-variate relationships

Relationships observed that involved features of interest

The quality rating received by a wine did not have a strong relationship with any other variables. Quality of a wine was compared with variables like alcohol, density, residual.sugar and total.sulfur.dioxide. Almost all the wines had the same range of values for these other variables.

To find how the mean and median values of these variables compared to the quality of wine, I have created a new variable named “quality_class”. The wines with quality rating less than or equal to 4 belong to the class “low”, the wines with quality rating 5 or 6 belong to the class “medium” and the wines with quality rating higher than 6 belong to the class “high”.

It was observed that the wines with higher quality had more amounts of alcohol and less density than the wines with lower quality. The amounts of residual sugar was almost all the same in all the quality classes of wine.

Relationships observed among other features

The other features explored were residual.sugar, density and alcohol. There is no strong relationship between residual.sugar and alcohol but density has strong relationships with both residual sugar and alcohol. There is a positive linear relationship between residual sugar and density and there is a negative linear relationship between alcohol and density.

Strongest relationships observed

The strongest relationship found was between residual sugar and density and also between alcohol and density.

Relationships among multiple variables

The wines that received a higher quality rating appear to have lower density at lower amounts of residual sugar when compared to the wines with lower quality ratings.

As we have already noticed that wines with higher quality ratings have lower density and high amounts of alcohol, this fact is again revealed by this plot. It can be noticed that until 10% of alcohol by vloume, the wines with high quality have higher density but after that these wines have the lowest density than the other wines.

As the previous plots indicated, the wines with higher quality ratings have higher amounts of alcohol at lower amounts of residual sugar. It can also be noticed that the amount of alcohol in the wines belonging to medium quality ratings suddenly increase at higher amounts of residual sugar.

Multivariate Analysis

Relationships observed

This part of the analysis did not reveal any new features or patterns, instead it strengthened the patterns found in the previous parts of the analysis. As we have already seen that wines with high quality have higher amounts of alcohol and lower density, surprisingly it can be seen from the plot density vs alcohol that wines with high quality have higher density than wines with lower quality until 10% alcohol by volume. After 10% alcohol by volume, the density of the wines is less than the lower quality wines.


Final Plots and Summary

Plot One

Description One

This plot indicates that wines with higher quality have more amounts of alcohol than in the wines with lower quality. Even though the difference in the means of the amount of alcohol is not very big, it is still considerable.

Plot Two

Description Two

This plot shows a linear relationship between the amount of residual sugar and the density of the wine. The reason for this linear relationship could be the fact that density of wine is measured based on the percentage of alcohol and the sugar content of the wine.

Plot Three

Description Three

Thee above two plots look interesting because the findings prior to this plot indicated that usually the wines that belonged to higher quality class have slightly high amount of alcohol and lower density. But this plot shows that for a particular alcohol content, the density of the wines with higher quality rating is higher than the wines with lower quality rating. But this trend is only observed at alcohol amounts between 8 and 10. At alcohol amounts greater than 10, the wines with higher quality rating have lower density.


Reflection

The white wine dataset contains 4898 instances of quality rating received by various wines. Each instance also contains 11 other variables based on the physicochemical tests. I started the EDA of this dataset by looking at the distribution of these individual variables. All the variables except the quality variable had numerical values. The distribution of residual.sugar variable was skewed and I transformed the variable using log transformation.

The variables did not have strong correlation coefficients and the only variables pairs which had some reasonable correlation were residual.sugar & alcohol, residual.sugar & density, density & alcohol, quality & alcohol. Exploring these pairs did not result in any strong relationships apart from residual.sugar vs density and density vs alcohol which had linear relationships. This lack of correlation among variables caused trouble as the plots did not reveal any proper relationships to explore further.

To determine how wines with similar quality rating compared to each other, I created a new variable called “quality_class” with values “low”, “medium” and “high”. Exploring these classes indicated that the wines that belonged to the ‘high’ quality class had lower density and high amounts of alcohol when compared to the other lower classes of wines. The amount of residual sugar was almost the same in the wines of all the classes. Dividing the wines into these classes and exploring them turned out to be a success as it helped in observing the trends in alcohol and density among the wines.

The exploratory analysis performed so far did not result in any strong relationships that could be used for building a predictive model. The reason for this could be the lack of important characteristics of wine in the dataset. The analysis could be improved if the dataset also had other characteristics such as the type of soil in which the grapes used were grown, the species of grapes used, the climate in which the wine was made etc.